Exploring Under-appreciated Rewards
نویسندگان
چکیده
This paper presents a novel form of policy gradient for model-free reinforcement learning (RL) with improved exploration properties. Current policy-based methods use entropy regularization to encourage undirected exploration of the reward landscape, which is ineffective in high dimensional spaces with sparse rewards. We propose a more directed exploration strategy that promotes exploration of under-appreciated reward regions. An action sequence is considered under-appreciated if its log-probability under the current policy under-estimates its resulting reward. The proposed exploration strategy is easy to implement, requiring small modifications to the REINFORCE algorithm. We evaluate the approach on a set of algorithmic tasks that have long challenged RL methods. Our approach reduces hyper-parameter sensitivity and demonstrates significant improvements over baseline methods. The proposed algorithm successfully solves a benchmark multi-digit addition task and generalizes to long sequences, which, to our knowledge, is the first time that a pure RL method has solved addition using only reward feedback.
منابع مشابه
Improving Policy Gradient by Exploring Under-appreciated Rewards
This paper presents a novel form of policy gradient for model-free reinforcement learning (RL) with improved exploration properties. Current policy-based methods use entropy regularization to encourage undirected exploration of the reward landscape, which is ineffective in high dimensional spaces with sparse rewards. We propose a more directed exploration strategy that promotes exploration of u...
متن کاملRationality in Human Movement.
It long has been appreciated that humans behave irrationally in economic decisions under risk: they fail to objectively consider uncertainty, costs, and rewards and instead exhibit risk-seeking or risk-averse behavior. We hypothesize that poor estimates of motor variability (influenced by motor task) and distorted probability weighting (influenced by relevant emotional processes) contribute to ...
متن کاملEfficacy of pins and diplomas as a reward for long-term smoking cessation.
AIMS AND BACKGROUND Since 2004, the Antismoking Center of the National Cancer Institute of Milan has rewarded those who have been ex-smokers for longer than a year with a "former smoker" pin and a diploma. We investigated firstly whether these rewards contributed to maintain smoking withdrawal, secondly, which one of these was more appreciated and why, and thirdly, how they may have influenced ...
متن کاملExploring Effects of Intrinsic Motivation in Reinforcement Learning Agents
We explore Intrinsic Motivation as a reward framework for learning how to perform complicated tasks. Most reinforcement learning tasks assume the existence of a critic who rewards the agent for its actions. However, taking inspiration for biological agents, we can say that the real critic is the agent itself. We experiment with a model where the rewards are generated by the agent using a proces...
متن کاملExploring EFL Learners' Beliefs toward Communicative Language Teaching: A Case Study of Iranian EFL Learners
Although Communicative Language Teaching (CLT) has been widely advocated by a considerable number of applied linguists and English language teachers, its implementation in English as a Foreign Language (EFL) contexts has encountered a number of difficulties. Reviewing the literature suggests that one of the reasons for unsuccessful implementation of CLT may be neglect of learners' beliefs in t...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2017